Statements
Executive Summary
1. Introduction
2. Aim and Methodology of
this Project
3. Exploratory Data Analysis -
EDA
4. Machine Learning Model
5. Results &
Discussion
6. Conclusion
Acknowledgement
I am sincerely thank my parents and family for giving me the support and
opportunity to invest my time on learning Machine Learning and
Artificial Intelligence to apply in environmental management work.
Furthermore, I thank the Google Career Certification courses for
providing me the resources to learn {python} Programming and learn about
the Machine Learning Concepts.
Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating
charts and adjusting visualization parameters in {python}. GenAI was
also used for code debugging. However, the responses provided by GenAI
were critically judged before being implemented.
Problem Statement
Salifort Motors is a fictional French-based alternative energy vehicle
manufacturer. The HR department at Salifort Motors wants to take some
initiatives to improve employee satisfaction levels at the company. They
refer to you as a data analytics professional and ask you to provide
data-driven suggestions based on your understanding of the data. They
have the following question: what’s likely to make the employee leave
the company?
Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.
Project Aim and Focus
Goals in this project are to analyze the data collected by the HR
department and to build a model that predicts whether or not an employee
will leave the company.
Raw data used
This project uses a dataset called HR_capstone_dataset.csv. It
represents 10 columns of self-reported information from employees of a
fictitious multinational vehicle manufacturing corporation.
Methodology
The following methodology was undertaken for this project, - Raw data -
HR_capstone_dataset.csv from the HR department is used to assess the
needs of the Senior leadership team.
- The merged data set is split into 70% training and 30% test data which
is used to train and predict using machine learning models.
- Analysis such as confusion matrix, feature importance and scoring
metrics is performed to analyse the models performance in predicting the
employee satisfaction levels and the main factors influencing the
employees to quit.
Results
Out of the models, .
Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. Its global workforce of over 100,000 employees research, design, construct, validate, and distribute electric, solar, algae, and hydrogen-based vehicles. Salifort’s end-to-end vertical integration model has made it a global leader at the intersection of alternative energy and automobiles.
The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to the data analytics professional and ask them to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?
Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.
For this project, the key stakeholders include the HR department and the senior leadership team, as they are directly involved in employee management and decision-making. The senior leadership team has tasked the data analyst with analyzing the dataset to come up with ideas for how to increase employee retention. To help with this, they would like you to build a machine learning model that predicts whether an employee will leave the company based on their department, number of projects, average monthly hours, and any other data points you deem helpful.
Goals
The primary objective is to identify and predict the underlying drivers
contributing to employee turnover, which can help in formulating
effective retention strategies. Goals in this project are to analyze the
data collected by the HR department and to build a model that predicts
whether or not an employee will leave the company.
Methodology
For this project, the analyst chooses a method to approach this data
challenge, either selecting a regression model or a tree-based machine
learning model to predict whether an employee will leave the company.
The following methodology was undertaken for this project,
HR_capstone_dataset.csv from the HR
department is used to assess the needs of the Senior leadership
team.This project uses a dataset called HR_capstone_dataset.csv, which is downloaded from the Kaggle website here. In the EDA, the dataset is analysed and prepared for building the machine learning models. Analysis such as, - Loading the required packages and the data set
First, loading the libraries and packages that are needed for predicting the employee satisfaction project. The selected libraries provide functions for handling data, building and performing machine learning tasks, and visualizing results.
# Import packages
# Operational Packages
import numpy as np
import pandas as pd
import io
import pickle
# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from tabulate import tabulate
# Modelling packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#XGBoost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
# Modelling evaluation and metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_treeTo start the project, loading the dataset
HR_capstone_dataset.csv, and analyse the basic of the
dataset. The dataset represents 10 columns of self-reported information
from employees of a fictitious multinational vehicle manufacturing
corporation.
# Load dataset into a dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load CSV
df0 = pd.read_csv(r"D:\Study\Machine Learning\Projects\R-Git\Completed projects for GitHub\Predicting-the-employee-satisfaction-levels-at-Salifort-Motors\Data\HR_capstone_dataset.csv")
# Format first 5 rows like a kable table
print(tabulate(df0.head(), headers='keys', tablefmt='latex'))## \begin{tabular}{rrrrrrrrrll}
## \hline
## & satisfaction\_level & last\_evaluation & number\_project & average\_montly\_hours & time\_spend\_company & Work\_accident & left & promotion\_last\_5years & Department & salary \\
## \hline
## 0 & 0.38 & 0.53 & 2 & 157 & 3 & 0 & 1 & 0 & sales & low \\
## 1 & 0.8 & 0.86 & 5 & 262 & 6 & 0 & 1 & 0 & sales & medium \\
## 2 & 0.11 & 0.88 & 7 & 272 & 4 & 0 & 1 & 0 & sales & medium \\
## 3 & 0.72 & 0.87 & 5 & 223 & 5 & 0 & 1 & 0 & sales & low \\
## 4 & 0.37 & 0.52 & 2 & 159 & 3 & 0 & 1 & 0 & sales & low \\
## \hline
## \end{tabular}
In this step, gaining a comprehensive understanding of the data set and preparing it for modelling is essential. This involves reviewing all variables to understand their data types, statistical distributions, and relevance to the target objective.
# Gather basic information about the data
# Create a StringIO buffer
buffer = io.StringIO()
# Capture the output of df.info() into the buffer
df0.info(buf=buffer)
# Get the content from the buffer
info_str = buffer.getvalue()
# Print the content
print(info_str)## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 14999 entries, 0 to 14998
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 satisfaction_level 14999 non-null float64
## 1 last_evaluation 14999 non-null float64
## 2 number_project 14999 non-null int64
## 3 average_montly_hours 14999 non-null int64
## 4 time_spend_company 14999 non-null int64
## 5 Work_accident 14999 non-null int64
## 6 left 14999 non-null int64
## 7 promotion_last_5years 14999 non-null int64
## 8 Department 14999 non-null object
## 9 salary 14999 non-null object
## dtypes: float64(2), int64(6), object(2)
## memory usage: 1.1+ MB
# Print the descriptive statistics
print(tabulate(df0.describe(), headers='keys', tablefmt='simple'))## satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
## ----- -------------------- ----------------- ---------------- ---------------------- -------------------- --------------- ------------ -----------------------
## count 14999 14999 14999 14999 14999 14999 14999 14999
## mean 0.612834 0.716102 3.80305 201.05 3.49823 0.14461 0.238083 0.0212681
## std 0.248631 0.171169 1.23259 49.9431 1.46014 0.351719 0.425924 0.144281
## min 0.09 0.36 2 96 2 0 0 0
## 25% 0.44 0.56 3 156 3 0 0 0
## 50% 0.64 0.72 4 200 3 0 0 0
## 75% 0.82 0.87 5 245 4 0 0 0
## max 1 1 7 310 10 1 1 1
The HR_capstone_dataset.csv dataset contains 14999 row
entries and 10 columns, out of which, 2 are float, 6 are integers and 2
are objects. Upon initial exploration of the data set, most of the
variables in the survey data align with prediction variables but certain
variables can be engineered for effective predictions. Ethical
considerations at this point, is the consideration of the bias in the
recorded data both during the analysis and while interpreting and
presenting the results to ensure fairness and accuracy.
Descriptive analysis of the dataset is shown here. Based on this,
In this step, the HR_capstone_dataset.csv dataset is
then cleaned by addressing missing values, removing redundant or
duplicate entries, and identifying any anomalies or inconsistencies.
Outliers that could potentially distort model performance is also
detected and evaluated for appropriate handling. These steps ensures
that the dataset was accurate, consistent, and ready for further
analysis, laying a solid foundation for building reliable predictive
models.
Rename columns
As a data cleaning step, rename the columns as needed. Standardizing
the column names so that they are all in snake_case,
correcting any column names that are misspelled, and making sure column
names more concise as needed.
## Index(['satisfaction_level', 'last_evaluation', 'number_project',
## 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
## 'promotion_last_5years', 'Department', 'salary'],
## dtype='object')
# Rename columns as needed
df = df0.copy()
df = df0.rename(columns={'satisfaction_level':'satisfaction',
'last_evaluation':'last_eval',
'number_project':'#_projects',
'average_montly_hours':'avg_mon_hrs',
'time_spend_company':'tenure',
'Work_accident':'work_accident',
'promotion_last_5years':'promotion_<5yrs',
'Department':'department'
})
# Display all column names after the update
df.columns## Index(['satisfaction', 'last_eval', '#_projects', 'avg_mon_hrs', 'tenure',
## 'work_accident', 'left', 'promotion_<5yrs', 'department', 'salary'],
## dtype='object')
Check missing values
Checking for any missing values in the data. There appears to be no missing values in this dataset.
## satisfaction 0
## last_eval 0
## #_projects 0
## avg_mon_hrs 0
## tenure 0
## work_accident 0
## left 0
## promotion_<5yrs 0
## department 0
## salary 0
## dtype: int64
Check duplicates
Checking for any duplicate entries in the data. Based on the duplicate data set, there are several continuous variables across all the 10 columns which is very highly likely that these observations are duplicates. Therefore dropping them will help in making accurate predictions.
## np.int64(3008)
# Inspect some rows containing duplicates as needed
print(tabulate(df[df.duplicated()].head(), headers='keys', tablefmt='simple'))## satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs department salary
## ---- -------------- ----------- ------------ ------------- -------- --------------- ------ ----------------- ------------ --------
## 396 0.46 0.57 2 139 3 0 1 0 sales low
## 866 0.41 0.46 2 128 3 0 1 0 accounting low
## 1317 0.37 0.51 2 127 3 0 1 0 sales medium
## 1368 0.41 0.52 2 132 3 0 1 0 RandD low
## 1461 0.42 0.53 2 142 3 0 1 0 sales low
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df.drop_duplicates(keep='first')
# Display first few rows of new dataframe as needed
print(tabulate(df1.head(), headers='keys', tablefmt='simple'))## satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs department salary
## -- -------------- ----------- ------------ ------------- -------- --------------- ------ ----------------- ------------ --------
## 0 0.38 0.53 2 157 3 0 1 0 sales low
## 1 0.8 0.86 5 262 6 0 1 0 sales medium
## 2 0.11 0.88 7 272 4 0 1 0 sales medium
## 3 0.72 0.87 5 223 5 0 1 0 sales low
## 4 0.37 0.52 2 159 3 0 1 0 sales low
Check outliers
Checking for outliers in the data. Certain types of models are more sensitive to outliers than others. Considering whether to remove outliers, is based on the type of models that will be used in the project.
# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(16,6))
plt.title('Detecting outliers for tenure (Boxplot)', fontsize=15)
plt.xticks(fontsize=8)## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0.0, 0, '0.0'), Text(0.2, 0, '0.2'), Text(0.4, 0, '0.4'), Text(0.6000000000000001, 0, '0.6'), Text(0.8, 0, '0.8'), Text(1.0, 0, '1.0')])
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0, 0.0, '0.0'), Text(0, 0.2, '0.2'), Text(0, 0.4, '0.4'), Text(0, 0.6000000000000001, '0.6'), Text(0, 0.8, '0.8'), Text(0, 1.0, '1.0')])
The box plot shows that there are outliers in the tenure
column. Checking how many rows contain outliers in the
tenure column.
# Determine the number of rows containing outliers
# 25th Percentile for tenure
percentile25 = df1['tenure'].quantile(0.25)
# 75th Percentile for tenure
percentile75 = df1['tenure'].quantile(0.75)
# IQR - Inter Quartile Range
iqr = percentile75 - percentile25
# Limits of the tenure
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print('Lower limit:', lower_limit)## Lower limit: 1.5
## Upper limit: 5.5
# Identifying the outliers in 'tenure'
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
# print the rows containing the outliers
print(f'Number of rows containing outliers in tenure:', len(outliers))## Number of rows containing outliers in tenure: 824
Beginning by understanding how many employees left and what percentage of all employees this figure represents.
## left
## 0 11428
## 1 3571
## Name: count, dtype: int64
# Get percentages of people who left vs. stayed
### YOUR CODE HERE ###
print(df['left'].value_counts(normalize=True))## left
## 0 0.761917
## 1 0.238083
## Name: proportion, dtype: float64
Examining variables that are interesting to the relevance of the project and create plots to visualize relationships between variables in the data.
# Select only numeric columns
numeric_df = df1.select_dtypes(include=['number'])
# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(numeric_df.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);
plt.show()Correlation heatmap,
# PLots to analyse Tenure vs satisfaction; tenure vs left distribution
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (18,6))
# Tenure vs left distribution
tenure_stay = df1[df1['left']==0]['tenure']
tenure_left = df1[df1['left']==1]['tenure']
sns.histplot(data=df1, x='tenure', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('Tenure distribution classified by employee who left', fontsize=12)
# Tenure vs Satisfaction
sns.boxplot(data=df1, x='satisfaction', y='tenure', hue='left', orient="h", saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Satisfaction vs Tenure', fontsize=12)
plt.show()Box Plot
Histogram Plot
Histogram distribution shows that only few people stay more than 5 years which might be due to promotions to higher ranks in the company
# plot for #_project vs avg_mon_hrs; distribution of #_projects
fig, ax = plt.subplots(1, 2, figsize = (18,6))
# distribution of #_projects
projects_stay = df1[df1['left']==0]['#_projects']
projects_left = df1[df1['left']==1]['#_projects']
sns.histplot(data=df1, x='#_projects', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('No of projects distribution classified by employee who left', fontsize=12)
# #_project vs avg_mon_hrs
sns.boxplot(data=df1, x='avg_mon_hrs', y='#_projects', hue='left', orient="h",saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Average monthly hours by No of project', fontsize=12)
plt.show()Based on the plots,
Histogram
Box Plots
Employees who left the company,
# Plots for satisfaction vs salary; satisfaction vs last_eval;
fig, ax = plt.subplots(1, 2, figsize = (18,6))
# plot for satisfaction vs salary
sns.boxplot(data=df1, x='satisfaction', y='salary', hue='left',
orient="h", saturation=0.75, ax=ax[0])
ax[0].invert_yaxis()
ax[0].legend(loc='upper left', title='Left')
ax[0].set_title('Satisfaction vs Salary', fontsize=12)
# Plot for satisfaction vs avg_mon_hrs
sns.scatterplot(data=df1, x='satisfaction', y='avg_mon_hrs', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Satisfaction level by average monthly work hours', fontsize='14')Based on the plots,
Box plot
Salary has high relation with the satisfaction level. At low and medium salary level, there is very low satisfaction scores and high number of employees who left the company.
Scatter plot
Employees dissatisfaction level is very low who worked for long hours in the company and has a less than 0.5 satisfaction level aligns with employees who worked less hours which might be due to that they are fired or might have given notice to leave the company. This confirms with the previous box plots.
# Plot for avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs
fig, ax = plt.subplots(1, 2, figsize = (18,6))
# Plot for avg_mon_hrs vs promotion_<5yrs
sns.scatterplot(data=df1, x='avg_mon_hrs', y='promotion_<5yrs', hue='left', ax=ax[0])
ax[0].set_title('Average monthly hours by promotion in the last 5 years', fontsize=12)
# Plot for avg_mon_hrs vs last_eval
sns.scatterplot(data=df1, x='avg_mon_hrs', y='last_eval', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Average monthly hours by evaluation score', fontsize=14)Based on the plot,
Avg_mon_hrs vs Promotion_<5yrs
avg_mon_hrs vs last_eval Employeed who left,
# Plot for distribution of employee who left by department
plt.figure(figsize=(11,8))
sns.histplot(data=df1, x='department', hue='left', discrete=1,
hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.title('Employees distribution classified by department', fontsize=12)
plt.show()Sales, Technical and Support department are the top three department where the employees left compared to the other departments
Key drivers of employees who left are because,
Most of the employees are mostly burned out working long hours, more number of projects and not receiving any benefits such as promotion or higher salary. This mainly points out the poor company management and the company policies that might have to be investigated further.
# paCe: Construct Stage
- Determine which models are most appropriate
- Construct the model
- Confirm model assumptions
- Evaluate model results to determine how well your model fits the
data
🔎 \## Recall model assumptions
**Logistic Regression model assumptions**
- Outcome variable is categorical
- Observations are independent of each other
- No severe multicollinearity among X variables
- No extreme outliers
- Linear relationship between each X variable and the logit of the outcome variable
- Sufficiently large sample size
### Reflect on these questions as you complete the constructing stage.
- Do you notice anything odd?
- Which independent variables did you choose for the model and why?
- Are each of the assumptions met?
- How well does your model fit the data?
- Can you improve it? Is there anything you would change about the
model?
- What resources do you find yourself using as you complete this
stage? (Make sure to include the links.)
- Do you have any ethical considerations in this stage?
[Double-click to enter your responses here.]
# Model Building, Training and Predictions
# Results and Evaluation
- Fit a model that predicts the outcome variable using two or more
independent variables
- Check model assumptions
- Evaluate the model
### Identify the type of prediction task.
**Objective -** to analyse whether or not the employee leaves the
company
### Identify the types of models most appropriate for this task.
This dependent variable has categorical values (0 & 1) which involves
binary classification.
Model to use, - Logistic regression - Tree-based ML model
### Modelling
Add as many cells as you need to conduct the modeling process.
### Logistic Regression
Binomial logistic regression suits this objective.
**Steps to take for the model**
- Categorical variables must be encoded as the numeric values, i.e.
department and salary
- Department is a set of category which can be encoded with dummy
values
- Salary is hierarchial set of category, which should be encoded with
ordinal values (0 - low, 1 - medium , 2 - high)
# Encoding the categorical into numerical
# Copy the dataframe for the modelling
enc_df = df1.copy()
# Mapping the salary category with ordinal numbers according to hierarchy
salary_map = {'low':0, 'medium':1, 'high':2}
# Creating a new column for the salary map
enc_df['salary'] = enc_df['salary'].map(salary_map)
# Encoding the department with dummy variables
enc_df = pd.get_dummies(enc_df, drop_first=False)
enc_df.head()## satisfaction last_eval #_projects avg_mon_hrs tenure work_accident \
## 0 0.38 0.53 2 157 3 0
## 1 0.80 0.86 5 262 6 0
## 2 0.11 0.88 7 272 4 0
## 3 0.72 0.87 5 223 5 0
## 4 0.37 0.52 2 159 3 0
##
## left promotion_<5yrs salary department_IT department_RandD \
## 0 1 0 0 False False
## 1 1 0 1 False False
## 2 1 0 1 False False
## 3 1 0 0 False False
## 4 1 0 0 False False
##
## department_accounting department_hr department_management \
## 0 False False False
## 1 False False False
## 2 False False False
## 3 False False False
## 4 False False False
##
## department_marketing department_product_mng department_sales \
## 0 False False True
## 1 False False True
## 2 False False True
## 3 False False True
## 4 False False True
##
## department_support department_technical
## 0 False False
## 1 False False
## 2 False False
## 3 False False
## 4 False False
# Removing the outliers in the tenure and saving it in a new dataframe
df_lr = enc_df[(enc_df['tenure'] >= lower_limit) & (enc_df['tenure'] <= upper_limit)]
df_lr.head().reset_index(drop=True)## satisfaction last_eval #_projects avg_mon_hrs tenure work_accident \
## 0 0.38 0.53 2 157 3 0
## 1 0.11 0.88 7 272 4 0
## 2 0.72 0.87 5 223 5 0
## 3 0.37 0.52 2 159 3 0
## 4 0.41 0.50 2 153 3 0
##
## left promotion_<5yrs salary department_IT department_RandD \
## 0 1 0 0 False False
## 1 1 0 1 False False
## 2 1 0 0 False False
## 3 1 0 0 False False
## 4 1 0 0 False False
##
## department_accounting department_hr department_management \
## 0 False False False
## 1 False False False
## 2 False False False
## 3 False False False
## 4 False False False
##
## department_marketing department_product_mng department_sales \
## 0 False False True
## 1 False False True
## 2 False False True
## 3 False False True
## 4 False False True
##
## department_support department_technical
## 0 False False
## 1 False False
## 2 False False
## 3 False False
## 4 False False
## (11167, 19)
(11167, 19)
# Setting the 'y' variable
y = df_lr['left']
# Setting the 'x' variable with dropping the left column
X = df_lr.drop('left', axis=1)# Split the data into training (75%) and test (25%) dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)# Constructing the LogReg model
log_clf = LogisticRegression(random_state=0, max_iter=500)
# Fitting the model
log_clf.fit(X_train,y_train)LogisticRegression(max_iter=500, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
| penalty | 'l2' | |
| dual | False | |
| tol | 0.0001 | |
| C | 1.0 | |
| fit_intercept | True | |
| intercept_scaling | 1 | |
| class_weight | None | |
| random_state | 0 | |
| solver | 'lbfgs' | |
| max_iter | 500 | |
| multi_class | 'deprecated' | |
| verbose | 0 | |
| warm_start | False | |
| n_jobs | None | |
| l1_ratio | None |
# Use the model for the test dataset
y_pred = log_clf.predict(X_test)
# Constructing a confusion matrix
# Computing values in the matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)
# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm,
display_labels = log_clf.classes_)
# Plot confusion matrix
log_disp.plot(values_format='')## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000023FC8E77D60>
Model accurately predicts,
Checking the class imbalance
## left
## 0 0.831468
## 1 0.168532
## Name: proportion, dtype: float64
The data shows 83% - 17% split and shows imbalance. Based on the model performance, we can check whether the data should be resampled with a bit of balance split
# Create classification report for logistic regression model
row_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test, y_pred, target_names=row_names))## precision recall f1-score support
##
## Predicted would not leave 0.86 0.94 0.90 2321
## Predicted would leave 0.47 0.24 0.32 471
##
## accuracy 0.83 2792
## macro avg 0.66 0.59 0.61 2792
## weighted avg 0.79 0.83 0.80 2792
Classification report shows,
The model shows very low scores in the objective which is the importance to predict employees who will leave. Hence, we can try other classification model - Decision Tree and Random Forest
# Using the enc_df dataframe
# Setting the y variable
y = enc_df['left']
# Setting the X variable
X = enc_df.drop('left',axis=1)
# Split the data into training (75%) and test (25%) dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)# Instantia the decision tree model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[2, 4, 6, None],
'min_samples_leaf': [2, 6, 3],
'min_samples_split': [2, 5,7]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
dtree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [2, 4, 6, None],
'min_samples_leaf': [2, 6, 3],
'min_samples_split': [2, 5, 7]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | DecisionTreeC...andom_state=0) | |
| param_grid | {'max_depth': [2, 4, ...], 'min_samples_leaf': [2, 6, ...], 'min_samples_split': [2, 5, ...]} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | None | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
DecisionTreeClassifier(max_depth=4, min_samples_leaf=6, random_state=0)
| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 4 | |
| min_samples_split | 2 | |
| min_samples_leaf | 6 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | 0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
## {'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2}
## np.float64(0.9698667651120891)
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what you want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, accuracy, or auc
Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Create dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'auc': 'mean_test_roc_auc',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy'
}
# Get all the results from the CV and put them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolate the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extract Accuracy, precision, recall, and f1 score from that row
auc = best_estimator_results.mean_test_roc_auc
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Create table of results
table = pd.DataFrame()
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
'auc': [auc]
})
return table# Get all CV scores
dtree1_cv_results = make_results('Decision Tree 1 CV', dtree1, 'auc')
dtree1_cv_results## model precision recall F1 accuracy auc
## 0 Decision Tree 1 CV 0.91449 0.916279 0.915345 0.971867 0.969867
The metrics scores are very high. The model performance is very good, but decision tree model is prone to overfitting. Random Forest Model is performed to compare the models
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3],
'n_estimators': [100]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
param_grid={'max_depth': [3, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3], 'n_estimators': [100]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | RandomForestC...andom_state=0) | |
| param_grid | {'max_depth': [3, None], 'max_features': [1.0], 'max_samples': [0.7, 1.0], 'min_samples_leaf': [1, 2, ...], ...} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | -1 | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
RandomForestClassifier(max_depth=3, max_features=1.0, max_samples=0.7,
min_samples_leaf=3, random_state=0)| n_estimators | 100 | |
| criterion | 'gini' | |
| max_depth | 3 | |
| min_samples_split | 2 | |
| min_samples_leaf | 3 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 1.0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 0 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | 0.7 | |
| monotonic_cst | None |
## {'max_depth': 3, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
## np.float64(0.9741727239274944)
# Get all CV scores
rf1_cv_results = make_results('Random Forest 1 CV', rf1, 'auc')
results = pd.concat([rf1_cv_results,dtree1_cv_results], axis=0)
results## model precision recall F1 accuracy auc
## 0 Random Forest 1 CV 0.836573 0.916277 0.874469 0.956299 0.974173
## 0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
Based on the model training results, - Random Forest Model scores better than the Decision Tree, exception is the recall but not too significant just 0.001 lower. - Random Forest Model performs well than the Decision Tree and the test set can be evaluated using the Random Forest.
def get_scores(model_name:str, model, X_test_data, y_test_data):
'''
Generate a table of test scores.
In:
model_name (string): How you want your model to be named in the output table
model: A fit GridSearchCV object
X_test_data: numpy array of X_test data
y_test_data: numpy array of y_test data
Out: pandas df of precision, recall, f1, accuracy, and AUC scores for your model
'''
preds = model.best_estimator_.predict(X_test_data)
auc = roc_auc_score(y_test_data, preds)
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'f1': [f1],
'accuracy': [accuracy],
'AUC': [auc]
})
return table
# Get predictions on test data
rf1_test_scores = get_scores('Random Forest 1 Test', rf1, X_test, y_test)
rf1_test_scores## model precision recall f1 accuracy AUC
## 0 Random Forest 1 Test 0.855535 0.915663 0.884578 0.960307 0.942431
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 Test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
Test results are similar to the training results, which shows that the model is very good. The model’s performance will be similar when new unseen data is fitted, as the test data was used only for this model.
Round 1 Models included all the variables as features for the model prediction. For the Round 2 Models, Feature engineering will be used to customize the data for improving the model.
What can be engineered in the dataset, - Satisfaction level cannot be reported for all the employees. So, dropping it would be an option - Average monthly hours might have data leakage, as it might be recorded after the employees gives notice to resign or company has given the notice to leave. So, maybe engineering this variable to a new variable as overworked might help improve the models prediction
# Drop `satisfaction_level` and save resulting dataframe in new variable
df2 = enc_df.drop('satisfaction', axis=1)
# Display first few rows of new dataframe
df2.head()## last_eval #_projects avg_mon_hrs tenure work_accident left \
## 0 0.53 2 157 3 0 1
## 1 0.86 5 262 6 0 1
## 2 0.88 7 272 4 0 1
## 3 0.87 5 223 5 0 1
## 4 0.52 2 159 3 0 1
##
## promotion_<5yrs salary department_IT department_RandD \
## 0 0 0 False False
## 1 0 1 False False
## 2 0 1 False False
## 3 0 0 False False
## 4 0 0 False False
##
## department_accounting department_hr department_management \
## 0 False False False
## 1 False False False
## 2 False False False
## 3 False False False
## 4 False False False
##
## department_marketing department_product_mng department_sales \
## 0 False False True
## 1 False False True
## 2 False False True
## 3 False False True
## 4 False False True
##
## department_support department_technical
## 0 False False
## 1 False False
## 2 False False
## 3 False False
## 4 False False
# Create `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['avg_mon_hrs']
# Inspect max and min average monthly hours values
print('Max hours:', df2['overworked'].max())## Max hours: 310
## Min hours: 96
# Define `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)
# Display first few rows of new column
df2['overworked'].head()## 0 0
## 1 1
## 2 1
## 3 1
## 4 0
## Name: overworked, dtype: int64
Max hours: 310
Min hours: 96
Assuming the 40 hrs job/per week with two weeks vacation policy, Average working hours per month = 40 hours * 50 weeks / 12 months = 166.67 hours
Overworked can be defined as working hours more than 175 hours per month on average.
0 0
1 1
2 1
3 1
4 0
Name: overworked, dtype: int64
# Drop the `average_monthly_hours` column
df2 = df2.drop('avg_mon_hrs', axis=1)
# Display first few rows of resulting dataframe
df2.head()## last_eval #_projects tenure work_accident left promotion_<5yrs \
## 0 0.53 2 3 0 1 0
## 1 0.86 5 6 0 1 0
## 2 0.88 7 4 0 1 0
## 3 0.87 5 5 0 1 0
## 4 0.52 2 3 0 1 0
##
## salary department_IT department_RandD department_accounting \
## 0 0 False False False
## 1 1 False False False
## 2 1 False False False
## 3 0 False False False
## 4 0 False False False
##
## department_hr department_management department_marketing \
## 0 False False False
## 1 False False False
## 2 False False False
## 3 False False False
## 4 False False False
##
## department_product_mng department_sales department_support \
## 0 False True False
## 1 False True False
## 2 False True False
## 3 False True False
## 4 False True False
##
## department_technical overworked
## 0 False 0
## 1 False 1
## 2 False 1
## 3 False 1
## 4 False 0
# Isolate the outcome variable
y = df2['left']
# Select the features
X = df2.drop('left', axis=1)
# Create test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)# Instantiate model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
dtree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | DecisionTreeC...andom_state=0) | |
| param_grid | {'max_depth': [4, 6, ...], 'min_samples_leaf': [2, 5, ...], 'min_samples_split': [2, 4, ...]} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | None | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 6 | |
| min_samples_split | 6 | |
| min_samples_leaf | 2 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | 0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
## {'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
## np.float64(0.9586752505340426)
0.9586752505340426
# Get all CV scores
dtree2_cv_results = make_results('Decision Tree 2 CV', dtree2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results], axis=0)
results## model precision recall F1 accuracy auc
## 0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
## 0 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675
## 0 Random Forest 1 CV 0.836573 0.916277 0.874469 0.956299 0.974173
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3],
'n_estimators': [100]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
param_grid={'max_depth': [3, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3], 'n_estimators': [100]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | RandomForestC...andom_state=0) | |
| param_grid | {'max_depth': [3, None], 'max_features': [1.0], 'max_samples': [0.7, 1.0], 'min_samples_leaf': [1, 2, ...], ...} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | -1 | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
RandomForestClassifier(max_features=1.0, max_samples=1.0, min_samples_leaf=3,
random_state=0)| n_estimators | 100 | |
| criterion | 'gini' | |
| max_depth | None | |
| min_samples_split | 2 | |
| min_samples_leaf | 3 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 1.0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 0 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | 1.0 | |
| monotonic_cst | None |
## {'max_depth': None, 'max_features': 1.0, 'max_samples': 1.0, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
## np.float64(0.960396763726207)
0.9648100662833985
# Get all CV scores
rf2_cv_results = make_results('Random Forest 2 CV', rf2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results,rf2_cv_results], axis=0)
results## model precision recall F1 accuracy auc
## 0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
## 0 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675
## 0 Random Forest 1 CV 0.836573 0.916277 0.874469 0.956299 0.974173
## 0 Random Forest 2 CV 0.912536 0.880106 0.895991 0.966085 0.960397
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
| 0 | Decision Tree 2 CV | 0.856693 | 0.903553 | 0.878882 | 0.958523 | 0.958675 |
| 0 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
| 0 | Random Forest 2 CV | 0.866758 | 0.878754 | 0.872407 | 0.957411 | 0.964810 |
Based on the training results for the two rounds of Decision Tree and Random Forest Model,
# Get predictions on test data
rf2_test_scores = get_scores('Random Forest 2 Test', rf2, X_test, y_test)
test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
test_results## model precision recall f1 accuracy AUC
## 0 Random Forest 1 Test 0.855535 0.915663 0.884578 0.960307 0.942431
## 0 Random Forest 2 Test 0.901010 0.895582 0.898288 0.966311 0.937991
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 Test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
| 0 | Random Forest 2 Test | 0.870406 | 0.903614 | 0.886700 | 0.961641 | 0.938407 |
Plotting a Confusion Matrix to visualize the model’s predictions on the test set
# Generate array of values for confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf2.classes_)
disp.plot(values_format='');A perfect model would yield all true negatives and true positives, and no false negatives or false positives.
In this case, Model predicts more false positives than false negatives, which means that the employees are at risk of getting fired or leaving voluntarily but that is not the case. Although it still is a strong model for predicting the employees that stay
# Plot the tree
plt.figure(figsize=(85,20))
plot_tree(dtree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns,
class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()# Feature important
dtree2_importances = pd.DataFrame(dtree2.best_estimator_.feature_importances_,
columns=['gini_importance'],
index=X.columns
)
dtree2_importances = dtree2_importances.sort_values(by='gini_importance', ascending=False)
# Only extract the features with importances > 0
dtree2_importances = dtree2_importances[dtree2_importances['gini_importance'] != 0]
dtree2_importances## gini_importance
## last_eval 0.343958
## #_projects 0.343385
## tenure 0.215681
## overworked 0.093498
## department_support 0.001142
## salary 0.000910
## department_sales 0.000607
## department_technical 0.000418
## work_accident 0.000183
## department_IT 0.000139
## department_marketing 0.000078
| gini_importance | |
|---|---|
| last_eval | 0.343958 |
| #_projects | 0.343385 |
| tenure | 0.215681 |
| overworked | 0.093498 |
| department_support | 0.001142 |
| salary | 0.000910 |
| department_sales | 0.000607 |
| department_technical | 0.000418 |
| work_accident | 0.000183 |
| department_IT | 0.000139 |
| department_marketing | 0.000078 |
sns.barplot(data=dtree2_importances, x="gini_importance", y=dtree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=12)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()Feature importance plot for the decision tree model shows that
last_evaluation, #_project,
tenure, and overworked have the importance
from high to the least which helps in predicting the outcome variable
‘employee left’
Now, plot the feature importances for the random forest model.
# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_
# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]
# Get column labels of top 10 features
feat = X.columns[ind]
# Filter `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]
y_df = pd.DataFrame({"Feature":feat,"Importance":feat_impt})
y_sort_df = y_df.sort_values("Importance")
fig = plt.figure()
ax1 = fig.add_subplot(111)
y_sort_df.plot(kind='barh',ax=ax1,x="Feature",y="Importance")
ax1.set_title("Random Forest: Important variables that have an impact in employees leaving", fontsize=12)
ax1.set_ylabel("Feature")
ax1.set_xlabel("Importance")
plt.show()Feature importance plot for the Random Forest is the same as the Decision Tree model - feature importance plot
# pacE: Execute Stage
- Interpret model performance and results
- Share actionable steps with stakeholders
✏ \## Recall evaluation metrics
- **AUC** is the area under the ROC curve; it's also considered the
probability that the model ranks a random positive example more
highly than a random negative example.
- **Precision** measures the proportion of data points predicted as
True that are actually True, in other words, the proportion of
positive predictions that are true positives.
- **Recall** measures the proportion of data points that are predicted
as True, out of all the data points that are actually True. In other
words, it measures the proportion of positives that are correctly
classified.
- **Accuracy** measures the proportion of data points that are
correctly classified.
- **F1-score** is an aggregation of precision and recall.
💭 \### Reflect on these questions as you complete the executing stage.
- What key insights emerged from your model(s)?
- What business recommendations do you propose based on the models
built?
- What potential recommendations would you make to your
manager/company?
- Do you think your model could be improved? Why or why not? How?
- Given what you know about the data and the models you were using,
what other questions could you address for the team?
- What resources do you find yourself using as you complete this
stage? (Make sure to include the links.)
- Do you have any ethical considerations in this stage?
Double-click to enter your responses here.
Logistic Regression Model
Model’s performance on the test set shows very low scores in the objective which is the importance to predict employees who will leave,
Tree-based Machine Learning
After the feature engineering, on the test set, Random Forest Model outperformed the decision tree model
Decision Tree model performs with,
Random Forest model performs with,
From the initial assessment, EDA and Visualization, the employees are overworked due to the poor company management. This is also confirmed with the model and feature importance
Following recommendations could be presented to the stakeholders for retaining the employees:
Next Steps - Having a structured method for getting employees evaluation and satisfaction score before the employee leaves the company, as this might tend to data leakage. This might help in mitigating this issues and will help improve the model’s performance